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1. INTRODUCTION 

Automatic speech recognition (ASR) is an important topic of speech processing. ASR is a technology 
that allows an electronic platform such as a smartphone or a computer to identify spoken words by humans [1]. 
Speech is a powerful and natural tool for communication. For this, the speech recognition system makes 
the interaction between a human and a machine more fluid and simpler [2]. In recent years, the researchers 
have developed more important research in biometric security technology with speaker recognition to make 
the communication between humans and machines to be more natural [3]. The speaker recognition system can 
be classified into identification and verification (recognition). Speaker identification is the process of 
automatically recognizing who 1s speaking based on individual information included in speech waves. Speaker 
verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. 
This technique makes the speaker verify their identity and control such as security control, telephone shopping, 
access services to the voice mail, database access services and remote access to computers [4]. 

An ASR system involves two phases: the training phase and the testing phase. At the training phase, 
the parameters of the classification model are estimated using a large number of training data. The extraction 
of features is done from all speech signals using various feature extraction techniques such as MFCC, LPC, 
LDA, RASTA, etc [1, 3, 5]. These features are in the form of vectors is stored in reference models, in particular, 
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the acoustic model, which is used to characterize that word using the classification algorithm in the testing 
phase [5]. 

The most technique used for speech features comprises the Mel-frequency cepstral coefficients 
(MFCC), perceptual linear prediction (PLP) and the linear predictive coding (LPC) coefficients. [6]. 
The MFCCs are the best known, for that reason we use it in this paper. The MFCCs are the best known, they 
are less susceptible to speaker-dependent variations and therefore we use it in this study. [7]. Many matching 
techniques are used in speech and speaker recognition, such as dynamic time warping (DTW), hidden Markov 
models (HMM) that are very frequently used in speech recognition [8, 9], artificial neural network (ANN), 
gaussian mixture model (GMM) and vector quantization (VQ) are generative models used for creating 
a speaker model. The Gaussian mixture model is widely used in speaker identification modeling system [10]. 
In this paper, the VQ method is employed and it will be compared with the GMM model. The employed method 
has characterized by its easiest implementation and its highest accuracy. For vector quantization (VQ) the LBG 
(Linde, Buzo, and Gray) algorithm and the k-means algorithm are the most familiar algorithms [11-13]. 

In voice applications, speech is damaged due to interference with background noise. Consequently, 
we cannot know whether the signal contains valid information or not through direct observation [14]. In this 
paper, the performance of a speaker identification system presented for the clean speech has been further 
investigated here by adding noise in particular additive white Gaussian noise (AWGN) to the clean ‘speakers’ 
utterances in the training and testing phase [15, 16]. 

Much works is done in speech processing for many languages in speaker system independent, 
automatic speech recognition for an isolated word or continuous speech such as [17-19], etc. Existing speech 
recognition systems are working well for European languages like English [20-24]. The researches for 
the Arabic language speech recognition is still weak, especially the continuous speech recognition in a noisy 
environment [25-29]. However, all applications in speech recognition are mostly available in English, like 
the works presented in [30-34]. Despite Arabic being the fourth most widely spoken language of the world, 
which is why our research was focused on Arabic speech recognition in noisy environments and because it is 
the first work using the methods that we mentioned earlier. 

The Arabic language is the fourth largest language spoken by nearly 1.6 billion Muslims native 
speakers, this language spoken by the majority of the people in the Middle East and North Africa; note that 
Arabic has many different dialects. This is some little work in Arabic speaker and speech recognition [35-38]. 
We presented in Table 1 a Literature review for speech recognition research using MFCC, GMM and VQ 
techniques regarding Arabic or other languages. 

— Background: speech recognition with speaker identification systems have widely extensive applied fields. 
Many works had performed in this area using multiple techniques. MFCC, HMM, GMM and VQ are 
the most prominent methods. 

— Objectives: The aim of this paper is to execute a small-scale Arabic digit’s speech recognition system in 
a noisy environment based on MFCC in features extraction and hybrid GMM-VQ for features training 
and classification. This system can recognize and respond to digits' speech inputs and compare 
an unknown speaker's speech against a database of N known speakers. The best match is returned as 
the identified speaker and the digit is spoken. 

— The problem: The Arabic language is among the most spoken languages in the world with around 300 
million native speakers. However, compared to other languages, the research is still poor in Arabic speech 
recognition, especially in a noisy environment. The presence of background noise, as well as the diversity 
of Arabic dialects, are considered challenges for Arabic speech recognition. Studying Moroccan Arabic, 
which is very difficult, which is even challenging in the orthographic rules, the multiple accents and 
vocabulary according to the regions of Morocco. 

— The proposed solution: To contribute to developing Arabic speech recognition systems, we built two 
systems in one. The first for speaker identification and the second for Arabic spoken digits with AWGN 
background noise. Wherefore, we propose the hybrid GMM-VQ model along with MFCC as a feature 
extraction technique. Here VQ was used for training data and for speaker identification then the GMM 
for recognition. The efficacy of the proposed method is observed while performing different experiments 
and compared to earlier work. 

The database for this work has been built using 100 male and female speakers, the vocabulary consists 
of 10 words representing the Arabic spoken digits from 0 to 9. MFCC technique has been used to extract 
the features and GMM, VQ, GMM-VQ models have been used for recognition. The system is tested by using 
test data spoken by 15 speakers and achieves an overall word-accuracy of 98.33% in clean condition using 
GMM+VQ. 

The rest of this paper is organized as follows: In section 2, we clarify the system architecture in more 
depth. The experimental results are presented in section 3 followed by discussion in section 4. Finally, 
we indicate the conclusion and future work in section 5. 
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Table 1. Summary table of limited literature review on speech recognition studies done in Arabic and other languages 


Author Language Year Feature extraction Method Recogntion rate (%) 
Giorgio Biagetti et al. [4] English 2017 Karhunen-Loeéve EM, GMM 97.70% (noisy conditions) 
Timit transform (DKLT) 
Mohit Dua et al. [5] Hindi 2018 MFCC, GFCC, HMM-GMM MFCC 65.25 %, GFCC 75.02%, 
BFCC BFCC 75.56 % 
Bhadragiri Jagan Mohan and English 2014 MFCC DTW satisfying 
Ramesh Babu N. [7] 
T. K. Das et al. [8] English 2016 MFCC VQ, HMM 90% 
D. Nagajyothi and Speaker 2017 MFCC VQ, LBG high accuracy 
P. Siddaiah [11] voice 
English 
Arnav Gupta and Harshit Gupta Speaker 2013 MFCC VQ 89% 
[12] voice 
English 
Ankur Maurya et al. [13] Hindi 2017 MFCC VQ, GMM 85.49 % using MFCC -VQ 94.12 
% using MFCC-GMM 
Veena and Mathew [14] English 2015 MFCC SVM-GMM 95% in clean, 90% in noisy 
Timit 
Musab Al-Kaltakchi et al. [15] English 2017 PNCC, MFCC GMM-UBM 95% in clean, 
Timit 75.83% SNR (0-30) dB 
S. B. Dhonde and S. M. Jagade English 2016 MFCC VQ, GMM 98.4 % with MFCC -VQ 
[17] Timit 99. 2 % with MFCC-GMM 
U. G. Patil et al. [18] Hindi 2016 MFCC VQ-GMM 94.31% 
S. Karpagavalli et al. [21] Tamil 2012 MFCC HMM 92% 
Rafik Djemili et al. [22] English 2012 MFCC GMM, MLP, 96.4% with GMM, MLP, VQ 
IViE VQ, LVQ 94.6% with LVQ 
corpus 
Bidhan Barai et al [23] English 2017 MFCC, GFCC VQ/GMM 100% in clean, 90% in noisy 
Chen Wang et al [24] English 2008 MFCC VQ-GMM 93.1%. 
N Hammami and M Bedda [25] Arabic 2010 MFCC VQ - MWST 93.12% 
Awais Mahmood et al [26] Arabic 2014 MFCC, MDLF GMM 96.89% 
M Alsulaiman et al [28] Arabic 2016 MFCC, MDLF, and GMM 94% 
MDLF-MA 
Mohamed Khelifa et al [29] Arabic 2017 MFCC HMM /GMM Between 94% and 97% 
Azzedine Touazi and Mohamed Arabic 2017 MFCC HMM 99.89 % in clean, 
Debyeche [35] 95.94% in multi-condition 
Anissa Imen Amrous et al [38] Arabic 2011 MFCC HMM 93.91 % in clean, 


2. THE SYSTEM ARCHITECTURE 
2.1. Arabic digits speech recognition system 

The first ten Arabic digits are : “Siffer”, “Wahed”, “Ithnani”, “Thalatha”, “Arbaa’’, “Khamsa”, “Sitta”, 
“Sabaa”, “thamanya” and “tisaa”.The Arabic spoken digits would be helpful in many applications such as 
telephone dialing systems, banking systems, airline reservations, etc. These ten digits are polysyllabic words 
except “zero/siffer” which is a monosyllable word as shown in Table 2. The syllables in the Arabic language 
are CV, CVC, and CVCC information the decoder needs to do its job. V indicates a (long or short) vowel while 


32.33% in Pink noise 5dB 


C indicates a consonant. Arabic utterances can only start with a consonant [37]. 


Table 2. Arabic digits 


Digits Arabic writing Pronunciation Syllables Number of syllables 
0 ka sefr CVCC l 
1 aly wa-hed CV-CVC 2 
2 itl aath-nayn CVC-CVC 2 
3 ai tha-la- thah CV-CV-CVC 3 
4 ias aar-baah CVC-CV-CVC 3 
5 iai kham-sah CVC-CVC 2 
6 Aiu set-tah CVC-CVC 2 
7 Azi sub-aah CVC-CVC 2 
8 Agile’ tha-ma-nyeh CV-CV-CVC 3 
9 dai tes-ah CVC-CVC 2 


2.2. The system architecture 

Automatic speaker recognition system (ASR) was defined as the process of identifying a speaker 
by analyzing spectral shape of his voice signal. The Speaker recognition as illustrated in Figure 1 represents 
the process of identifying a person from his voice after recording it with a microphone and compares it with 
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another stored as training. This block system is split into two phases: the first one represents the training phase 
and the second one is testing. During these two phases, speaker identification consists of four steps: voice 
recording, feature extraction, pattern matching and decision (recognized/not recognized). 





Identified the speaker ID 
And the digit world 





Figure 1. General structure of speech and speaker recognition system 


The system is separated into two portions: speaker identification and spoken digit recognition. Where: 
- Spoken digit recognition: to recognize the word among 10 Arab digits words. 
- Speaker identification: to recognize the speaker for a particular spoken word. 


2.2.1. Training phase 

The speaker’s reference database along with the speaker IDs and their audio recordings are stored; the system 
can build a reference model for that speaker. Regarding in training phase for spoken digit recognition, we 
previously recorded a database and converting it into acoustic vectors using Mel-frequency cepstrum 
coefficients (MFCC). 


2.2.2. Testing phase 

In the testing phase, the system checks that the speaker's input speech is similar at that which is stored 
in the reference. Therefore, the system can identify the person who is speaking and the digit that saying. 
During this phase, 450 voice samples are recorded by 10 male voices and 5 female voices chosen from our 
database. These clean test data is then mixed with white Gaussian noise (AWGN) with different levels of SNRs 
(5, 10, 15, 20 dB). The codebook vectors are developed using the proposed VQ-GMM approach from a specific 
speaker's voice. Then, they will be compared with the reference models obtained in the training phase. 


2.3. Mel-frequency cepstrum coefficients (MFCC) 

Mel-frequency cepstrum coefficients (MFCC) are popular features extracted from speech signals for 
use in recognition tasks. MFCCs are based on a perceptually scaled frequency axis. This also allows for better 
representation of the speech. The following relation is used to calculate the Mels scale for a given frequency 
f (Hz) of signal is given by (1). 


Mel(f) = 2595 logio (1+) (1) 


700 


Figure 2 shows the general block diagram for extraction of MFCC features vectors. The basic five operations 
are carried on speech signal to get the cepstral coefficients. The acoustic features used in this evaluation are 
composed of 39 parameters with 13 MFCCs. 
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Figure 2. Detailed MFCC process 


2.4. Vector quantization 

Vector Quantization (VQ) is a classical and the most frequently used pattern-matching technique [2]. 
We will use the VQ approach, in this paper, due to its easiest implementation and its higher accuracy. 
This technique consists of extracting a small number of representative feature vectors as an efficient means of 
characterizing the speaker-specific features. The training data features created by VQ method are combined to 
create a codebook for each speaker. In the recognition phase, the system compares the difference between a 
speaker's test data and the codebook of each speaker. Accordingly, it concludes the recognition result [39-41]. 

Figure 3 shows an illustrative diagram of this recognition process. One speaker can be discriminated 
against from another base of the location of centroids. In the training phase, using the clustering LBG 
algorithm [42]. In Figure 3 we are limited to present two speakers and her acoustic vectors. The yellow circles 
refer to the acoustic vectors from the speaker 1 while the blue circles are from the speaker 2. A speaker-specific 
VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result 
codewords (centroids) are shown in Figure 3 by black circles for speaker 1 and red circles for speaker 2. 
The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. The VQ distortion 
illustrates the distance from the nearest codebook, calculated in the testing phase of speaker recognition system. 
The ‘adequate’ speaker corresponds to minimum VQ distortion, so it is selected and verified [42]. We used 
the LBG algorithm to build a codebook from a set of training vector for this purpose [42]. 


Speaker #1 Speaker #2 






VQ 


Speaker #1 
distortion 


O sample 
@ centroid 


Speaker #2 
@ sample 
@ centroid 


Figure 3. Vector quantization codebook 


2.5. Gaussian mixture model 

Gaussian mixture model (GMM) is one of the non-parametric methods, it is a parametric probability 
density function represented as a weighted sum of Gaussian component densities.There is a great similarity 
between Gaussian mixture model and Vector quantization model in terms of overlapping clusters. The symbol 
named à represents collectively these parameters; it is given in formula 2. Each speaker is represented by 
a GMM and is referred to by his/her model À. 


A = {P;, fi, Xi} | = 1, M (2) 


where l; is the mean vector and X; the covariance matrix of the normally distributed random variable A. 
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In this method, the distribution of the feature vector x is modeled clearly using a mixture of M 
Gaussians. GMM parameters are estimated from training data using the iterative expectation-maximization 
(EM) algorithm. These parameters of GMM are computed in training phase to create a speaker model. In testing 
phase, the speaker model having highest a posteriori probability for the features of an unknown voice is selected 
as identity of that unknown speaker [14, 17, 43, 44]. A Gaussian mixture model is a weighted sum of 
M component Gaussian densities as given by the equation: 


M 
P(x|A)= Dow; B(x] Zi) (3) 
i=] 


The following diagram as shown in Figure 4 illustrates the GMM modelling process of speaker data. It shows 
the illustrative steps of Gaussian mixture modelling of speaker’s database. 





m 


TRAINING SPEAKER 
GMM ESTIMATION 


Identified speaker 





-nbori poies 
l 





Figure 4. GMM Block diagram of Speaker identification 


2.6. Signal to noise ratio 
The Signal to Noise Ratio (SNR) is a method to measure the signal strength relative to background 
noise levels. The SNR is expressed by decibels (dB) using this formula: 


P 
SNR(dB) = 10x log (=>) (4) 


noise 


where, P 


speech = 


L > 


speech signal x(t) is degraded by additive signal noise n(t) by: 


P 


noise 


= Puy ) denote the power of speech signal and noise, respectively. The clean 


n(t) = awgn (x, snr) (5) 
So, the observed noisy speech y(t) can be expressed as: 
yO) =x(+ n) (6) 


An example of the corruption of the clean signal with an additive white Gaussian noise (AWGN) is given in 
Figure 5. 
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magnitude spectrum for clean speech “wahid" 





o 1 2 3 4 5 & 7 


magnitude spectrum for noisy speech “wahid" 





Figure 5. The magnitude spectrum form of the original clean and 
the noisy signal of the word “/wahid/” recorded in SNR=5 dB 


3. EXPERIMENTAL RESULTS 
3.1. Corpus preparation 

In this paper, we have considered two categories of a database, containing the noisy data and 
clean data; 
— Clean database 

We have registered a database ARBDIGITS of 100 Arab Moroccan speakers including males and 
females have been built from ‘siffer’ (zero) to ‘Tisaa’ (nine) which are used for training purpose. The speaker 
speaks the word several times isolated. These voices have been recorded at sampling frequencies 8000 Hz. 
We use the noise removal tool available in "audacity" software to delete background noise from the original 
recording and then we get the clean data. 
— Noisy database 

For the data that we use it in the testing phase, the noise is added by the MATLAB function adding 
white Gaussian noise (AWGN) to the clean database ARBDIGITS at various levels of signal to noise ratio 
(SNR) varying from 5 dB to 20 dB. These testing samples consist of 10 male voices and 5 female voices both 
aged between 15 and 40 years. They recorded the first ten Arabic digits repeated three times. All the previous 
analysis was made only in one noisy condition (SNRs level from 5 dB into 20 dB and AWGN noise). 
Table 3 reports more technical details about ARADIGITS database used in experimental evaluation. 


Table 3. Information and condition of the ARBDIGITS corpus used 


Process Description 
Participant 100 Speakers (70 Male 30 females) 
Environment Reverberant and two channels-stereo mode. 
Words 10 Arabic spoken Digits 
Training Set 85 Speakers 
Testing Set 15 Speakers 
Number of clean words selected 10x3x100 = 3000 
Total Number of noisy words 3000 x 1x 4 = 12000 
noise type used AWGN 
SNR level used Clean,5 dB, 10 dB, 15 dB, 20 dB 
Total Size of Database 1 GB 
Sampling Frequency, fs 8000Hz 
Software used for mixing MATLAB R2016b v Trial 


3.2. Results 
Firstly, the performance of the system is evaluated by recognition rate. It is calculated by: 


Successfully detected word 


Rec Rate = 100 (7) 


Total no.of words in test dataset 
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The average recognition rates obtained from the ten Arab digits in clean environment and in noisy 
environment, for different SNR values, are represented in Table 4 and efficiency chart is shown in Figure 6 
respectively. From the results shown in Table 4, we can conclude that the effect of the noise is not important 
if the SNR is superior to 20dB, in this case we obtain approximately the same ‘average’ value of recognition 
obtained in cleaned data. In testing phase for speaker identification, the spoken samples are recorded by 15 
speakers; (10 male speakers and 5 female speakers) chosen from our database ARBDIGITS of 100 Arab 
Moroccan speakers (the speech wave is with 8 KHz sampling frequency using AUDIORECORD function of 
MATLAB 2016 environment in windows platform in 64 bit). The sample collection process is accomplished 
by using the microphone to record the speech of male/female. 

The first testing phase, in clean condition, after this, we have tested with adding AWGN noise at 
different SNRs levels values of 5, 10, 15, 20 dB. Note that the speech segment was degraded when SNR<SdB. 
The percentage recognition of a speaker is given in the Table 5 and the efficiency chart is shown in Figure 7 
respectively. The average of 15 speaker’s recognition rates obtained with training by GMM in clean 
environment and in noisy environment, for different SNR values are represented in Table 6 and efficiency chart 
is Shown in Figure 8 respectively. Table 7 shows the overall recognition rates (%) for the speaker identification 
system using the combination of VQ and GMM algorithms and bar chart plot is shown in Figure 9. 


Table 4. Average recognition rate (%) using mfcc+vq 


Noise Level Digits Clean 5dB 10dB 15dB 20dB 
0 82.37 45.17 67.27 79.66 80.33 

l 80.11 42.34 65.17 75.11 79.28 

2 81.54 44.06 62.66 78.09 80.05 

3 79.08 40.47 60.87 75.33 78.00 

4 85.52 41.52 68.12 77.18 84.11 

5 80.17 38.78 58.34 74.23 80.17 

6 84.42 42.10 66.52 76.12 84.42 

7 83.67 39.27 62.71 75.25 82.05 

8 77.02 38.96 60.33 71.78 77.02 

9 86.63 47.03 61.51 79.00 85.28 
Average 82,05 41,97 63,35 76,17 81,07 


Table 5. Speaker Identification rate (%) for testing speech in clean and in AWGN noise for different SNR 
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Figure 6. Arab digits recognition success rates (%) 
in clean and in presence of awgn noise 


Figure 7. Speaker Identification result (%) in clean 
and for different SNR values 


Table 6. Speaker Identification rate (%) for testing speech in clean and in 
AWGN noise for different snr using GMM 


Methods 
MFCC+GMM 


#speakers clean 
10 males 96.12 
5 females 94.66 


5dB 10dB 15dB 20dB 
70.26 88.33 92.66 95 
70 87.67 92.33 96.33 
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Table 7. Speaker Identification rate (%) for testing speech in clean 
and with AWGN for different snr using GMM +VQ 


Clean 5dB 10dB 15dB 20dB 
10 Male 98.33 88.66 90.26 95.12 98 
5 female 97.12 86.33 90.66 94 97 


E 10 Male © Sfemele 





clean 15dB i | 
B1i0Male E5 femele CLEAN 
Figure 8. Speaker Identification result (%) in clean Figure 9. Speaker Identification rate in clean and 
and all SNR levels with AWGN noise noise using GMM +VQ 
using GMM modeling 


The results are implemented under MATLAB R2016b. For this, we have built a GUI interface as 
illustrated in Figure 10 to simplify the testing process where the speaker can be tested directly by a new voice 
recording or from test base. During recording, the user adds AWGN noise to her\his voice if he\she desired 
and he\she selects the SNR level. This GUI enables the recording or plotting of a sound as well as the recording 
of a new test data and identification of the speaker ID. 











2 Matching result æ es 


| (EE) copa 
f Yousad9 








Figure 10. GUI Main MATLAB system 


4. DISCUSSIONS 

In this work, for the experiments tested with the clean data the maximum performance received for 
MFCC+GMM+VQ is 97.92%, for MFCC+GMM is 95.39% % and for MFCC+VQ is 90.04 %. Also, for 
the tests with noisy data, the maximum accuracy received for MFCC+GMM-+VQ is 92.50 %, for MFCC+GMM 
is 86.57% and for MFCC+VQ is 65.64 %. It is clearly observed that better performance has been seen when 
using the three techniques together MFCC+GMM+VQ. From the results, it is clearly found that the Arabic 
digits speech recognition system and the speaker identification system performed well in both clean and noisy 
environment using MFCC as feature extraction and vector quantization. For our proposed model of 
he combination among three methods MFCC+GMM-+VQ, we observed that the accuracy of the results obtained 
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is high with either clean or noisy data compared to the results obtained by MFCC+GMM or MFCC+VQ 
methods. We can also conclude that the effect of the noise is not important if the SNR is superior to 20 dB, in 
this case we obtain approximately same ‘average’ value of recognition obtained in cleaned data. 

Note that the speaker identification system is more performant using the GMM method compared to VQ 
method, but when we have used the combination of both methods (GMM+VQ) we obtain better speaker 
identification rate than individual models. Also, we remark an increase of the identification rate by 14% is 
obtained when SNR=15 dB and GMM as modelling method, but this rate is reduced to 7% when SNR=5 dB. We 
can explain the difference between those two percentages in the two experiments by the background noise affect 
on the feature vectors and on the acoustic data, especially if the noise intensity is very strong then it can hide data, 
consequently the verification is failed. Thus, the verification is performed only on smaller amount of valuable 
data. As for the Arabic spoken digits system using VQ method, the parameters obtained after training the system 
for digit 6 and 7 are too close as shown in Figure 7. Therefore, if the digit is not spoken clearly during recognition, 
the system falters. The digit 8 gives the lowest accuracy, the reason being the speech sample for 8 has the highest 
amount of “unvoiced” speech signal. Therefore, it is treated as unvoiced speech data. 


5. CONCLUSION 

In this paper, we have presented an automatic system able to recognize the speaker as well as speech 
using MFCC, VQ and GMM technique for Arab digits words. The result shows that average accuracy for the 
system 1s 90.04% in clean environment and 75.86% in noisy environment for speaker identification using VQ 
and for Arabic digits recognition system is 82.05% in clean environment and average of 65.64% in noisy 
environments respectively. The average for speaker identification using GMM is 95.39% in clean environment 
and 86.57% in noisy environment. For the average of the combination (GMM+VQ) is 97.72% in clean 
environment and 92,50% adding AWGN noise, so this combination gives better identification rate than 
individual models. We can improve the obtained results if we use other methods such as ANN, HMM or DNN 
for classification. In the future works, we will compare this method to other methods in order to find a method 
that can improve the robustness in other types of noises, testing using other techniques and by increasing 
vocabulary size. 
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